On cherche à étudier l’effet de trois facteurs sur le transcriptome des racines d’Arabidopsis. On cherche ici à utiliser et comparer différents packages de multiDE sur nos données.
data <- read.csv("quantifFiles/quantifGenes.csv", h = T, sep = ",")
rownames(data) <- data$Gene
genes = which(!(grepl("__", rownames(data))))
not_quant = data[which((grepl("__", rownames(data)))), ]
data = data[genes, grepl("R", colnames(data))]
head(data) R3 R2 R1 R5 R4 R6 R7 R11 R10 R12 R13 R17 R16 R14
AT1G01010 1526 1006 1116 1275 967 1018 854 1132 1294 1364 2325 2113 2193 2564
AT1G01020 416 285 289 349 364 370 226 513 502 561 461 407 432 614
AT1G01030 31 15 19 29 36 28 12 47 34 47 18 40 32 44
R15 R18 R24 R19 R22 R23 R21 R20 R9 R8
AT1G01010 2364 2074 1987 2027 1754 1697 1537 1898 816 912
AT1G01020 380 502 484 467 426 415 413 462 223 312
AT1G01030 37 27 42 39 36 40 37 37 15 19
[ reached 'max' / getOption("max.print") -- omitted 3 rows ]
[1] 28775 24
annot <- read.csv("Code_for_RNAseq_CO2_N_Fr.csv", h = T, sep = ";")
conditions <- as.vector(unique(annot$Sample))
annot$ID <- paste0("R", annot$Code)
annot$condition <- substr(conditions, 1, nchar(conditions) - 1)
cond <- unique(substr(conditions, 1, nchar(conditions) - 1))
cond <- cond[grepl("At", cond)]
getCondition <- function(id) {
return(annot[annot$ID == id, "condition"])
}
getExactCondition <- function(id) {
return(annot[annot$ID == id, "Sample"])
}
getLabel <- function(id) {
text <- as.vector(annot[annot$ID == id, "Sample"])
res <- ""
nb <- substr(text, nchar(text), nchar(text))
if (grepl("Ambient", text)) {
res = paste0(res, "c")
} else {
res = paste0(res, "C")
}
if (grepl("High", text)) {
res = paste0(res, "N")
} else {
res = paste0(res, "n")
}
if (grepl("Starv", text)) {
res = paste0(res, "f")
} else {
res = paste0(res, "F")
}
res = paste0(res, "_", nb)
return(res)
}
getLabel("R6")[1] "cnF_3"
[1] At_AmbientCO2_LowNitrate_Fe1
48 Levels: At_AmbientCO2_HighNitrate_Fe1 ... Sl_ElevatedCO2_LowNitrate_FeStarvation3
TCC is an R package that provides a series of functions for differential expression analysis of tag count data. The package incorporates multi-step normalization methods, whose strategy is to remove potential DEGs before performing the data normalization. The normalization function based on this DEG elimination strategy (DEGES) includes (i) the original TbT method based on DEGES for two-group data with or without replicates, (ii) much faster methods for two-group data with or without replicates, and (iii) methods for multi-group comparison. TCC provides a simple unified interface to perform such analyses with combinations of functions provided by edgeR, DESeq, and baySeq.
On crée l’objet TCC avec le design souhaité, et on filtre les gènes avec de faibles expressions (paramètre low.count).
Lors de la normalisation (DEGES,iedgeR), on fait un premier calcul des gènes DE, pour pouvoir les enlever lors de la normalisation. Le maramètre test.method permet de choisir la manière de détecter les genes DE (edgeR, DEsqe2, ou tBt (très long)). On peut répéter cette procédure jusqu’à la convergence des facteurs de taille des librairies, d’ou le i.
# design
keep <- rowSums(data) >= 10
data <- data[keep, ]
group <- sapply(colnames(data), getCondition)
colnames(data) <- sapply(colnames(data), getLabel)
tcc <- new("TCC", data, group)
tccCount:
cNF_3 cNF_2 cNF_1 cnF_2 cnF_1 cnF_3 CNF_1 CnF_2 CnF_1 CnF_3 cNf_1
AT1G01010 1526 1006 1116 1275 967 1018 854 1132 1294 1364 2325
AT1G01020 416 285 289 349 364 370 226 513 502 561 461
AT1G01030 31 15 19 29 36 28 12 47 34 47 18
cnf_2 cnf_1 cNf_2 cNf_3 cnf_3 Cnf_3 CNf_1 Cnf_1 Cnf_2 CNf_3 CNf_2
AT1G01010 2113 2193 2564 2364 2074 1987 2027 1754 1697 1537 1898
AT1G01020 407 432 614 380 502 484 467 426 415 413 462
AT1G01030 40 32 44 37 27 42 39 36 40 37 37
CNF_3 CNF_2
AT1G01010 816 912
AT1G01020 223 312
AT1G01030 15 19
[ reached getOption("max.print") -- omitted 3 rows ]
Sample:
group norm.factors lib.sizes
cNF_3 At_AmbientCO2_HighNitrate_Fe 1 32032772
cNF_2 At_AmbientCO2_HighNitrate_Fe 1 25708325
cNF_1 At_AmbientCO2_HighNitrate_Fe 1 25388926
cnF_2 At_AmbientCO2_LowNitrate_Fe 1 29410725
cnF_1 At_AmbientCO2_LowNitrate_Fe 1 26720600
cnF_3 At_AmbientCO2_LowNitrate_Fe 1 26429519
CNF_1 At_ElevatedCO2_HighNitrate_Fe 1 16944203
CnF_2 At_ElevatedCO2_LowNitrate_Fe 1 28518287
CnF_1 At_ElevatedCO2_LowNitrate_Fe 1 30134658
CnF_3 At_ElevatedCO2_LowNitrate_Fe 1 30939161
cNf_1 At_AmbientCO2_HighNitrate_FeStarvation 1 30097179
cnf_2 At_AmbientCO2_LowNitrate_FeStarvation 1 29411029
cnf_1 At_AmbientCO2_LowNitrate_FeStarvation 1 32417299
cNf_2 At_AmbientCO2_HighNitrate_FeStarvation 1 34497190
cNf_3 At_AmbientCO2_HighNitrate_FeStarvation 1 31947050
cnf_3 At_AmbientCO2_LowNitrate_FeStarvation 1 30166724
Cnf_3 At_ElevatedCO2_LowNitrate_FeStarvation 1 33780095
CNf_1 At_ElevatedCO2_HighNitrate_FeStarvation 1 30427501
Cnf_1 At_ElevatedCO2_LowNitrate_FeStarvation 1 29577130
Cnf_2 At_ElevatedCO2_LowNitrate_FeStarvation 1 30405383
CNf_3 At_ElevatedCO2_HighNitrate_FeStarvation 1 25049240
CNf_2 At_ElevatedCO2_HighNitrate_FeStarvation 1 28496520
CNF_3 At_ElevatedCO2_HighNitrate_Fe 1 18823627
CNF_2 At_ElevatedCO2_HighNitrate_Fe 1 20700711
tcc <- filterLowCountGenes(tcc, low.count = 10)
# Normalisation
tcc <- calcNormFactors(tcc, norm.method = "tmm", test.method = "edger", iteration = 1,
FDR = 0.1, floorPDEG = 0.05)
tcc$norm.factors cNF_3 cNF_2 cNF_1 cnF_2 cnF_1 cnF_3 CNF_1 CnF_2
0.9662816 0.9730845 0.8976118 0.9973121 1.0475825 1.0072335 0.9752235 1.0522596
CnF_1 CnF_3 cNf_1 cnf_2 cnf_1 cNf_2 cNf_3 cnf_3
1.0186883 1.0510472 1.0019289 0.9977532 1.0188031 1.0400636 0.9948079 1.0087956
Cnf_3 CNf_1 Cnf_1 Cnf_2 CNf_3 CNf_2 CNF_3 CNF_2
0.9381444 0.9961312 1.0067383 1.0059904 1.0542699 1.0349814 0.9249584 0.9903091
user system elapsed
10.745 0.695 11.456
s <- sample(rownames(tcc$count), size = 200)
heatmap(as.matrix(tcc$count[s, ]), main = "Before normalisation")normalized.count <- getNormalizedData(tcc)
heatmap(as.matrix(normalized.count[s, ]), main = "After normalisation")# DEtest
tcc <- estimateDE(tcc, test.method = "edger", FDR = 0.001)
result <- getResult(tcc, sort = TRUE)
DEgenes <- subset(result, estimatedDEG == 1)
print(paste(dim(DEgenes)[1], " genes DE"))[1] "13655 genes DE"
gene_id a.value m.value p.value q.value rank estimatedDEG
1 AT1G02850 NA NA 0 0 45 1
2 AT1G03870 NA NA 0 0 45 1
3 AT1G05660 NA NA 0 0 45 1
4 AT1G06120 NA NA 0 0 45 1
5 AT1G09932 NA NA 0 0 45 1
6 AT1G12030 NA NA 0 0 45 1
plot(tcc, group = c("At_AmbientCO2_HighNitrate_Fe", "At_ElevatedCO2_HighNitrate_Fe"),
main = "Effet CO2 en conditions normales", ylim = c(-11, 11))plot(tcc, group = c("At_AmbientCO2_HighNitrate_FeStarvation", "At_ElevatedCO2_HighNitrate_FeStarvation"),
main = "Effet CO2 en Fe starvation", ylim = c(-11, 11))plot(tcc, group = c("At_AmbientCO2_LowNitrate_Fe", "At_ElevatedCO2_LowNitrate_Fe"),
main = "Effet CO2 en low Nitrate", ylim = c(-11, 11))plot(tcc, group = c("At_AmbientCO2_HighNitrate_Fe", "At_AmbientCO2_HighNitrate_FeStarvation"),
main = "Effet Fe en conditions normales", ylim = c(-11, 11))plot(tcc, group = c("At_AmbientCO2_LowNitrate_FeStarvation", "At_AmbientCO2_HighNitrate_FeStarvation"),
main = "Effet Nitrate en Fe starvation", ylim = c(-11, 11))# ggplot(melted_res, aes(Var2, Var1, fill= log(value))) + geom_tile() +
# scale_fill_distiller(palette = 'RdPu') + theme(axis.text.x = element_text(size
# = 0.5, angle = 320, hjust = 0, colour = 'grey50')) library(viridis) heatmap <-
# ggplot(melted_res, aes(Var2, Var1, fill= log(value))) + geom_raster() +
# scale_fill_viridis(trans='sqrt') + theme(axis.text.x=element_text(angle=65,
# hjust=1), axis.text.y=element_blank(), axis.ticks.y=element_blank()) heatmapIl semble que l’effet du fer soit le plus fort, et celui qui amplifie les autres effets. La technique utilisée ici identifie des DEG globalement, sans séparer lesquels sont dûs à quel effet.
On utilise des modèles de mélange pour regrouper les gènes ayant des profils d’expression similaires dans les différentes conditions.
AT1G02850 AT1G03870 AT1G05660 AT1G06120 AT1G09930 AT1G12000 AT1G12050 AT1G12150
10061 89198 6423 10592 3134 131939 30686 1084
AT1G13480 AT1G13590 AT1G17145 AT1G17150 AT1G26270 AT1G27050 AT1G43690 AT1G51710
43476 17469 20056 18 74275 6989 53707 85286
AT1G51940 AT1G51950 AT1G52760 AT1G56250 AT1G65410 AT1G65870 AT1G69640 AT1G73030
7530 17490 142691 2919 10285 39 35266 69999
AT1G74400 AT2G03710 AT2G13560 AT2G18350 AT2G21260 AT2G26290 AT2G29100 AT2G29730
680 249 171475 4716 140 23798 1005 20918
AT2G30360 AT2G31910 AT2G38680 AT2G43220 AT3G03272 AT3G22420 AT3G26450 AT3G27900
957 220 5616 3733 84 1695 9460 289
AT3G44550 AT3G44630 AT3G51360 AT3G52970 AT3G55485 AT3G56460 AT4G01335 AT4G11830
76049 9287 21040 207 146 47385 206 87385
AT4G11850 AT4G11860 AT4G14790 AT4G15830 AT4G16515 AT4G21105 AT4G21850 AT4G24590
144744 35429 13142 8702 21 97849 81045 25035
AT4G31360 AT4G32200 AT4G32340 AT4G32915 AT4G33160 AT5G04560 AT5G06100 AT5G10560
5780 5405 609 8515 6540 70613 16947 35020
AT5G12330 AT5G13080 AT5G13090 AT5G16990 AT5G19560 AT5G23360 AT5G26670 AT5G27460
32944 15900 11540 3828 5193 2763 13725 9187
AT5G37980 AT5G38200 AT5G38540
1072 8758 19441
[ reached getOption("max.print") -- omitted 13580 entries ]
conds <- str_split_fixed(colnames(data), "_", 2)[, 1]
run_pois <- coseq(data, conds = conds, K = 4:12, model = "Poisson", iter = 5, transformation = "none")****************************************
coseq analysis: Poisson approach & none transformation
K = 4 to 12
Use set.seed() prior to running coseq for reproducible results.
****************************************
Running g = 4 ...
[1] "Initialization: 1"
[1] "Log-like diff: 6383.35560997041"
[1] "Log-like diff: 2541.07674853667"
[1] "Log-like diff: 3138.40011106496"
[1] "Log-like diff: 678.272267739288"
[1] "Log-like diff: 1066.10599923997"
Running g = 5 ...
[1] "Initialization: 1"
[1] "Log-like diff: 5400.25029148319"
[1] "Log-like diff: 5241.9010062319"
[1] "Log-like diff: 4685.38879778667"
[1] "Log-like diff: 3580.50546012803"
[1] "Log-like diff: 4811.55846886488"
Running g = 6 ...
[1] "Initialization: 1"
[1] "Log-like diff: 1140.31426327848"
[1] "Log-like diff: 411.787293101512"
[1] "Log-like diff: 851.360266454866"
[1] "Log-like diff: 1793.95281200853"
[1] "Log-like diff: 3187.63748669865"
Running g = 7 ...
[1] "Initialization: 1"
[1] "Log-like diff: 2683.8750449631"
[1] "Log-like diff: 1201.68394211084"
[1] "Log-like diff: 405.376369060642"
[1] "Log-like diff: 447.164055201859"
[1] "Log-like diff: 309.386561233669"
Running g = 8 ...
[1] "Initialization: 1"
[1] "Log-like diff: 1157.93080280741"
[1] "Log-like diff: 399.753483835894"
[1] "Log-like diff: 517.710068453152"
[1] "Log-like diff: 376.842390858619"
[1] "Log-like diff: 307.473272700884"
Running g = 9 ...
[1] "Initialization: 1"
[1] "Log-like diff: 427.711994224773"
[1] "Log-like diff: 104.220355519937"
[1] "Log-like diff: 346.890544544054"
[1] "Log-like diff: 275.908685445025"
[1] "Log-like diff: 90.9370312869343"
Running g = 10 ...
[1] "Initialization: 1"
[1] "Log-like diff: 798.022501318409"
[1] "Log-like diff: 329.548169363579"
[1] "Log-like diff: 18.6276446152134"
[1] "Log-like diff: 217.089872419333"
[1] "Log-like diff: 402.178479307672"
Running g = 11 ...
[1] "Initialization: 1"
[1] "Log-like diff: 1748.3185841292"
[1] "Log-like diff: 2485.73678952775"
[1] "Log-like diff: 862.45351058019"
[1] "Log-like diff: 999.852019477722"
[1] "Log-like diff: 3389.9291006394"
Running g = 12 ...
[1] "Initialization: 1"
[1] "Log-like diff: 951.255640789051"
[1] "Log-like diff: 849.651933918463"
[1] "Log-like diff: 918.754379816341"
[1] "Log-like diff: 1429.19576522524"
[1] "Log-like diff: 2279.99055144227"
$logLike
$ICL
$profiles
$boxplots
$probapost_boxplots
$probapost_barplots
$probapost_histogram
*************************************************
Model: Poisson
Transformation: none
*************************************************
Clusters fit: 4,5,6,7,8,9,10,11,12
Clusters with errors: ---
Selected number of clusters via ICL: 12
ICL of selected model: -9323707
*************************************************
Number of clusters = 12
ICL = -9323707
*************************************************
Cluster sizes:
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7
403 962 360 195 3735 4105 1168
Cluster 8 Cluster 9 Cluster 10 Cluster 11 Cluster 12
1005 102 486 788 346
Number of observations with MAP > 0.90 (% of total):
11829 (86.63%)
Number of observations with MAP > 0.90 per cluster (% of total per cluster):
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7
349 849 265 160 3275 3591 1058
(86.6%) (88.25%) (73.61%) (82.05%) (87.68%) (87.48%) (90.58%)
Cluster 8 Cluster 9 Cluster 10 Cluster 11 Cluster 12
842 76 408 671 285
(83.78%) (74.51%) (83.95%) (85.15%) (82.37%)
On représente le clustering dans la plan principal d’une ACP
suppressMessages(library(ade4, warn.conflicts = F, quietly = T))
suppressMessages(library(adegraphics, warn.conflicts = F, quietly = T))
acp <- dudi.pca(log(data[, colnames(data) != "cluster"] + 0.1), center = TRUE, scale = TRUE,
scannf = FALSE, nf = 4)
summary(acp)Class: pca dudi
Call: dudi.pca(df = log(data[, colnames(data) != "cluster"] + 0.1),
center = TRUE, scale = TRUE, scannf = FALSE, nf = 4)
Total inertia: 24
Eigenvalues:
Ax1 Ax2 Ax3 Ax4 Ax5
22.60009 0.58171 0.12401 0.07288 0.04945
Projected inertia (%):
Ax1 Ax2 Ax3 Ax4 Ax5
94.1671 2.4238 0.5167 0.3037 0.2061
Cumulative projected inertia (%):
Ax1 Ax1:2 Ax1:3 Ax1:4 Ax1:5
94.17 96.59 97.11 97.41 97.62
(Only 5 dimensions (out of 24) are shown)
library("RColorBrewer")
data$cluster = clusters_per_genes[as.vector(rownames(data))]
s.corcircle(acp$co, xax = 1, yax = 2, fullcircle = FALSE, pback.col = "lightgrey")adegraphics::s.class(acp$li, xax = 1, yax = 2, as.factor(data$cluster), labels = as.character(levels(as.factor(data$cluster))),
col = brewer.pal(n = 10, name = "Paired"), chullSize = 1, ellipseSize = 0, plabels.cex = 0.7,
pbackground.col = "grey85", main = "Clusters dans le plan principal", ylim = c(-9,
9))adegraphics::s.class(acp$li, xax = 2, yax = 3, as.factor(data$cluster), labels = as.character(levels(as.factor(data$cluster))),
col = brewer.pal(n = 10, name = "Paired"), chullSize = 1, ellipseSize = 0, plabels.cex = 0.7,
pbackground.col = "grey85", main = "Clusters dans le plan principal", ylim = c(-9,
9))adegraphics::s.class(acp$li, xax = 2, yax = 3, as.factor(data$cluster), labels = as.character(levels(as.factor(data$cluster))),
col = brewer.pal(n = 10, name = "Paired"), chullSize = 1, ellipseSize = 0, plabels.cex = 0.7,
pbackground.col = "grey85", main = "Clusters dans le plan principal", ylim = c(-9,
9))adegraphics::s.class(acp$li, xax = 4, yax = 2, as.factor(data$cluster), labels = as.character(levels(as.factor(data$cluster))),
col = brewer.pal(n = 10, name = "Paired"), chullSize = 1, ellipseSize = 0, plabels.cex = 0.7,
pbackground.col = "grey85", main = "Clusters dans le plan principal", ylim = c(-9,
9))adegraphics::s.class(acp$li, xax = 4, yax = 3, as.factor(data$cluster), labels = as.character(levels(as.factor(data$cluster))),
col = brewer.pal(n = 10, name = "Paired"), chullSize = 1, ellipseSize = 0, plabels.cex = 0.7,
pbackground.col = "grey85", main = "Clusters dans le plan principal", ylim = c(-9,
9)) Il semblerait que l’ACP détecte dans un premier temps (avec le premier vecteur principal) la valeur moyenne d’expression, puis l’expression diff induite par le fer.
On peut faire des représentations dans le plan des seconds et troisième axes principaux (le premier traduit le fer, le second la carence en nitrate).
Sur le cercle de corrélations dans le plan 2 et 3, on voit bien que lorsque la carence fer est là, on peut différencier un effet nitrate. Le CO2 est, lui encore difficile à identifier, le 4eme axe de l’ACP ne renseignant pas beaucoup…
dualDE <- function(labels, pval = 0.01) {
# data
d <- data[, grepl(labels[1], colnames(data)) | grepl(labels[2], colnames(data))]
y <- DGEList(counts = d, group = str_split_fixed(colnames(d), "_", 2)[, 1])
y$samples
# filtre
keep <- filterByExpr(y)
y <- y[keep, , keep.lib.sizes = FALSE]
# normalisation
y <- calcNormFactors(y)
y$samples
norm <- cpm(y, normalized.lib.sizes = TRUE)
not_norm <- cpm(y, normalized.lib.sizes = FALSE)
rd_genes = sample(rownames(norm), 500)
heatmap(not_norm[rd_genes, ], main = "Sans normalisation")
heatmap(norm[rd_genes, ], main = "Avec normalisation")
# estimation of dispersion and tests
y <- edgeR::estimateDisp(y)
et <- exactTest(y)
best = topTags(et, n = 10000)
best = best[best$table$FDR < pval, ]
# best = best[abs(best$table$logFC) > 1,]
# genes <- rownames(cpm(y, normalized.lib.sizes=TRUE))
DEgenes = rownames(best$table)
plotBCV(y)
plotMD(et, p.value = pval)
abline(h = c(-1, 1), col = "blue")
abline(h = c(0), col = "red")
print(summary(decideTests(et, p.value = pval)))
# plots
plotMDS(y, top = 100, col = rep(1:2, each = 3))
logcpm <- cpm(y, log = TRUE)
heatmap(logcpm[DEgenes, ])
print("ON SELECTIONNE AU TOTAL :")
print(paste(length(DEgenes), "genes with fdr < ", pval))
}
labels <- c("cnF", "cNF")
dualDE(labels, pval = 0.01)Design matrix not provided. Switch to the classic mode.
cNF-cnF
Down 472
NotSig 10019
Up 1237
[1] "ON SELECTIONNE AU TOTAL :"
[1] "1709 genes with fdr < 0.01"
On retrouve une majorité de gènes activés par le fort nitrate. Les résultats sont un peu similaires à ceux du jeu de données d’entrainement de Nature entre témoin au KCl et Traitement au NO3.
Design matrix not provided. Switch to the classic mode.
CNF-cNF
Down 25
NotSig 11597
Up 10
[1] "ON SELECTIONNE AU TOTAL :"
[1] "35 genes with fdr < 0.01"
Design matrix not provided. Switch to the classic mode.
CnF-cnF
Down 404
NotSig 10878
Up 491
[1] "ON SELECTIONNE AU TOTAL :"
[1] "895 genes with fdr < 0.01"
Design matrix not provided. Switch to the classic mode.
CNf-cNf
Down 646
NotSig 10560
Up 736
[1] "ON SELECTIONNE AU TOTAL :"
[1] "1382 genes with fdr < 0.01"
Design matrix not provided. Switch to the classic mode.
Cnf-cnf
Down 375
NotSig 11004
Up 587
[1] "ON SELECTIONNE AU TOTAL :"
[1] "962 genes with fdr < 0.01"
Design matrix not provided. Switch to the classic mode.
cNF-cNf
Down 2125
NotSig 7515
Up 2329
[1] "ON SELECTIONNE AU TOTAL :"
[1] "4454 genes with fdr < 0.01"
A faire maybe soon